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Abstract 

High computational costs of manifold learn¬ 
ing prohibit its application for large point 
sets. A common strategy to overcome this 
problem is to perform dimensionality reduc¬ 
tion on selected landmarks and to succes¬ 
sively embed the entire dataset with the 
Nystrom method. The two main challenges 
that arise are: (i) the landmarks selected in 
non-Euclidean geometries must result in a 
low reconstruction error, (ii) the graph con¬ 
structed from sparsely sampled landmarks 
must approximate the manifold well. We 
propose the sampling of landmarks from de¬ 
terminantal distributions on non-Euclidean 
spaces. Since current determinantal sam¬ 
pling algorithms have the same complexity as 
those for manifold learning, we present an ef¬ 
ficient approximation running in linear time. 
Further, we recover the local geometry after 
the sparsification by assigning each landmark 
a local covariance matrix, estimated from the 
original point set. The resulting neighbor¬ 
hood selection based on the Bhattacharyya 
distance improves the embedding of sparsely 
sampled manifolds. Our experiments show 
a significant performance improvement com¬ 
pared to state-of-the-art landmark selection 
techniques. 

1 Introduction 

Spectral methods are central for a multitude of appli¬ 
cations in machine learning, statistics, and computer 
vision, such as dimensionality reduction [HUM], clas¬ 
sification [55] [53], and segmentation [55] 52 ■ A lim¬ 
iting factor for the spectral analysis is the computa¬ 
tional cost of the eigen decomposition. To overcome 
this limitation, the Nystrom method (36] is commonly 
applied to approximate the spectral decomposition of 
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the Gramian matrix. A subset of rows/columns is se¬ 
lected and based on the eigen decomposition of the 
resulting small sub-matrix, the spectrum of the origi¬ 
nal matrix can be approximated. While the Nystrom 
extension is the standard method for the matrix re¬ 
construction, the crucial part is the subset selection. 
In early work [36] . uniform sampling without replace¬ 
ment was proposed. This was followed by numerous 
alternatives including K-means clustering [57] . greedy 
approaches EUHl , and volume sampling mm a 
recent comparison of several approaches is presented 
in [20] - 

Of particular interest for subset selection is vol¬ 
ume sampling m, equivalent to determinantal sam¬ 
pling because reconstruction error bounds exist. 
This method is, however, not used in practice because 
of the high computational complexity of sampling from 
the underlying distributions SO]. Independently, de¬ 
terminantal point processes (DPPs) have been pro¬ 
posed recently for tracking and pose estimation [18] . 
They were originally designed to model the repulsive 
interaction between particles. DPPs are well suited 
for modeling diversity in a point set. A sampling al¬ 
gorithm for DPPs was presented in gum], which has 
complexity 0(n 3 ) for n points. Since this algorithm 
has the same complexity as the spectral analysis, it 
cannot be directly used as a subset selection scheme. 

In this paper, we focus on nonlinear dimensional¬ 
ity reduction for large datasets via manifold learning. 
Popular manifold learning techniques include kernel 
PCA [57] . Isomap [3D], and Laplacian eigenmaps [5. 
All of these methods are based on a kernel matrix of 
size 0(n 2 ) that contains the information about the 
pairwise relationships between the input points. The 
spectral decomposition of the kernel matrix leads to 
the low-dimensional embedding of the points. For 
large n, one is interested in avoiding its explicit cal¬ 
culation and storage. In contrast to general rank-A: 
matrix approximation, this is possible by taking the 
nature of the non-linear dimensionality reduction into 
account and relating the entries of the kernel matrix 
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directly to the original point set. 


the kernel matrix K such that 


Consequently, we propose to perform DPP sampling 
on the original point set to extract a diverse set of land¬ 
marks. Since the input points lie in a non-Euclidean 
space, ignoring the underlying geometry leads to poor 
results. To account for the non-Euclidean geometry 
of the input space, we replace the Euclidean distance 
with the geodesic distance along the manifold, which 
is approximated by the graph shortest path distance. 
Due to the high complexity of DPP sampling, we de¬ 
rive an efficient approximation that runs in 0(ndk), 
with input dimensionality d and subset cardinality k. 
The algorithm restricts the updates to be local, which 
enables sampling on complex geometries. This, to¬ 
gether with its low computational complexity, makes 
the algorithm well suited for the subset selection in 
large scale manifold learning. 

A consequence of the landmark selection is that the 
manifold is less densely sampled than before, making 
its approximation with neighborhood graphs more dif¬ 
ficult. It was noted in [3], as a critical response to [50] , 
that the approximation of manifolds with graphs is 
topologically unstable. In order to improve the graph 
construction, we retain the local geometry around each 
landmark by locally estimating the covariance matrix 
on the original point set. This allows us to compare 
multivariate Gaussian distributions with the Bhat- 
tacharyya distance for neighborhood selection, yield¬ 
ing improved embeddings. A shorter version of this 
work was published in [35]. 

2 Background 


We assume n points in high dimensional space 
xi,...,x n £ and let A' £ be the matrix 

whose *-th column is the point Xi . Non-linear di¬ 
mensionality reduction techniques are based on a pos¬ 
itive semidefinite kernel K , with a typical choice of 
Gaussian or heat kernel K it j = exp(— ||a;, — Xj\\ 2 /2a 2 ). 
The resulting kernel matrix is of size 0(n 2 ). Neces¬ 
sary for spectral analysis is the eigen decomposition of 
the kernel matrix, which has complexity 0(n 3 ). For 
most techniques, it is only necessary to compute the 
leading k eigenvectors. The problem can therefore 
also be considered as finding the best rank-fc approx¬ 
imation of the matrix AT, with the optimal solution 
Kk = Y^i =l ^i u i u J i where A i is the i-th largest eigen¬ 
value and Ui is the corresponding eigenvector. 


2.1 Nystrom Method 

Suppose J C {1, ..., n} is a subset of the original point 
set of size k and J is its complement. We can reorder 


K = 


Kjxj 

Kj x j 

, K = 

'Kj x j 

Kjxj 

_ K JXJ 

Kj x j 

Kjx J 

KjxjKJxjKjxjj 


with K being the matrix estimated via the Nystrom 
method [36]. The Nystrom extension leads to the 
approximation Kj x j « Kj x jKJ x jK Jx j. The ma¬ 
trix inverse is replaced by the Moore-Penrose gen¬ 
eralized inverse in case of rank deficiency. The 
Nystrom method leads to the minimal kernel comple¬ 
tion [4] conditioned on the selected landmarks and has 
been reported to perform well in numerous applica¬ 
tions The challenge lies in finding 

landmarks that minimize the reconstruction error 

11* - ^lltr = tr (Kj- xJ ) - tr(K] xJ K^jKj x j). (1) 

The trace norm ||.||t r is applied because results only 
depend on the spectrum due to its unitary invariance. 


2.2 Annealed Determinantal Sampling 

A large variety of methods have been proposed for 
selecting the subset J. For general matrix approx¬ 
imation, this step is referred to as row/column se¬ 
lection of the matrix K, which is equivalent to se¬ 
lecting a subset of points X. This property is im¬ 
portant because it avoids explicit computation of the 
0(n 2 ) entries in the kernel matrix K. We focus on 
volume sampling for subset selection because of its 
theoretical advantages nu. We employ the factor¬ 
ization Kj x j = YjYj , which exists because Kj is 
positive semidefinite. Based on this factorization, the 
volume Vol({Ti},; e j) of the simplex spanned by the ori¬ 
gin and the selected points Yj is calculated, which is 
equivalent to the volume of the parallelepiped spanned 
by Yj. The subset J is then sampled proportional 
to the squared volume. This is directly related to 
the calculation of the determinant with det(A'j x j) = 
det(Y J r lj) = det(Fj) 2 = Vol 2 ({Fi} ie j). These ideas 
were further generalized in [4] based on annealed de¬ 
terminantal distributions 


P S {J ) oc det (Kj xj ) s = detCF/F/) 8 = det(Yj) 2s . (2) 


This distribution is well defined because the princi¬ 
pal submatrices of a positive semidefinite matrix are 
themselves positive semidefinite. Varying the expo¬ 
nent s > 0 results in a family of distributions, model¬ 
ing the annealing behavior as used in stochastic com¬ 
putations. For s = 0 this is equivalent to uniform 
sampling [36] . In the following derivations, we focus on 
s = 1. It was shown in m that for J ~ p{ J), | J| = k 


E 


\K-K\\ tI <(k + l)\\K-K k \\ 2 F , 


(3) 
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where AT is the Nystrom reconstruction of the kernel 
based on the subset J, K k the best rank-A: approxi¬ 
mation achieved by selecting the largest eigenvectors, 
and ||.||ir the Frobenius norm. It was further shown 
that the factor fc + 1 is the best possible for a fc-subset. 
Related bounds were presented in [5]. A further result 
is that for any matrix A', there exist k + k{k + l)/e 
rows whose span contains the rows of a rank-fc matrix 
K k such that 

\\K-K k \\ 2 F <(l+e)\\K-K k \\ 2 F . (4) 


that the inner part of the roll is almost entirely ne¬ 
glected, as a consequence of not taking the manifold 
structure into account. A common solution is to use 
geodesic distances |3U] . which can be approximated 
by the graph shortest path algorithm. Consequently, 
we replace the Euclidean distance ||.|| in the construc¬ 
tion of the kernel matrix K with the geodesic distance 


K, 


= ex P ( — 11a;* - abl|g eo 


/2a 2 ). The result is shown 


in Fig. 1(c) We observe a clear improvement in the 
diversity of the sampling, now also including points in 
the interior part of the Swiss roll. 


This result establishes a connection between matrix 
approximation and projective clustering, with the se¬ 
lection of a subsets columns being similar to the con¬ 
struction of a coreset mug. 


3 Method 

In the following, we will first analyze the sampling 
from determinantal distributions on non-Euclidean ge¬ 
ometries. Subsequently, we introduce an efficient al¬ 
gorithm for approximate DPP sampling on manifolds. 
Finally, we present our approach for robust graph con¬ 
struction on sparsely sampled manifolds. 


3.1 DPP Sampling on Manifolds 

As described in Section |2.2[ sampling from determi¬ 
nantal distributions is used for row/column selection. 
Independently, determinantal point processes (DPPs) 
were introduced for modeling probabilistic mutual ex¬ 
clusion [22] . They present an attractive scheme for 
ensuring diversity in the selected subset. An interest¬ 
ing construction of DPPs is based on L-ensembles [5] . 
Given a positive semidefinite matrix L £ M" xn , the 
likelihood for selecting the subset J C {1,..., n} is 


Pl(J) 


det (L jxj ) 
det(A + 1)' 


(5) 


where / is the identity matrix and Lj x j is the sub¬ 
matrix of L containing the rows and columns indexed 
by J. Identifying the L-ensemble matrix L with the 
kernel matrix AT, we can apply DPPs to sample subsets 
from the point set X. 


To date, applications using determinantal point pro¬ 
cesses have assumed Euclidean geometry mm- F ° r 
non-linear dimensionality reduction, we assume that 
the data points lie on non-Euclidean spaces, such as 
the Swiss roll in Fig. 1(a) To evaluate the performance 
of DPPs on manifolds, we sample from the Swiss roll. 
Since we know the construction rule in this case, we 
can invert it and display the sampled 3D points in the 
underlying 2D space. The result in Fig. |l(b)| shows 


3.2 Efficient Approximation of DPP 
Sampling on Manifolds 

We have seen in the last sections that it is possi¬ 
ble to adapt determinantal sampling to non-Euclidean 
geometries and that error characterizations for the 
subset selection exist. However, we are missing an 
efficient sampling algorithm for dealing with large 
point sets. In HU, an approximative sampling based 
on the Markov chain Monte Carlo method is pro¬ 
posed to circumvent the combinatorial problem with 
(/) possible subsets. Further approximations include 
sampling proportional to the diagonal elements Ka 
or its squared version K/, leading to additive error 
bounds mm- In [15], an algorithm is proposed that 
yields a k\ approximation to volume sampling, wors¬ 
ening the approximation from (k + 1) to (k + 1)!. 


Algorithm 1 DPP sampling equivalent to [lSj 

Require: Eigen decomposition of AT: {(i^, Ai)}” =1 
1: for i = 1 to n do 

2: Add eigenvector Vi with probability j^-j- to V 

3: end for 
4: B = V T 
5: for 1 to \ V\ do 

6: Select i £ 1... n with probability P(i) oc ||Rj|| 2 

7: J £- J U i 

8: Bj £- Pro) ± b .Bj for all j £ {1, ..., n} 

9: end for 
10: return J 


An exact sampling algorithm for DPPs was presented 
in mm, which requires the eigen decomposition of 
K = A iV-ivJ. We state an equivalent reformula¬ 

tion of this sampling algorithm in Algorithm [l] First, 
eigenvectors are selected proportional to the magni¬ 
tude of their eigenvalues and stored as columns in V. 
Assuming m vectors are selected then V £ R™ xm . By 
setting B = V T , we denote the rows of V as Bi £ M m . 
In each iteration, we select one of the n points where 
point i is selected proportional to the squared norm 
11 11 2 . The point is added to the subset J. After the 
selection of i, all vectors Bj are projected to the or- 













Sampling from DPPs for Scalable Manifold Learning 


JS 




~y 


. 'C 

S' • 




■\:V 
' -K-.TA: 

(a) Swiss Roll 


• *!s 
**»r 


(b) Standard 



(c) Geodesic 


(d) Efficient 


Figure 1: DPP sampling from 1,000 points lying on a manifold. We show results for standard DPP sampling, 
geodesic DPP sampling, and efficient DPP sampling. Note that the sampling is performed in 3D, but we can 
plot the underlying 2D manifold by reversing the construction of the Swiss roll. Geodesic and efficient sampling 
yields a diverse subset from the manifold. 


thogonal space of B t . Since Proj ±B .-Bj = 0, the same 
point is almost surely not selected twice. The update 
formulation differs from jTS], where an orthonormal 
basis of the eigenvectors in V to the i-th basis vector 
ei G R ra is constructed. Both formulations are equiva¬ 
lent but provide a different point of view on the algo¬ 
rithm. The modification is, however, essential for the 
justification of the proposed efficient sampling proce¬ 
dure. The following proposition characterizes the be¬ 
havior of the update in the algorithm. 

Proposition 1. Let Bi,Bj G R m \ {0} be two non¬ 
zero vectors in R m , and 9 = Z(Bi,Bj ) be the angle 
between them. Then 

WProj^Bif = Ill'll 2 sin2 e > ( 6 ) 

where Projj_ B .Bj is the the projection of Bj on the 
subspace perpendicular to Bi. For Bi ^ 0 and Bj = 0 
the projection is \\Proj ±B Bj \\ 2 = 0. 

We state the proof in the supplementary material. 

Sampling from a determinantal distribution is not only 
advantageous because of the presented error bounds 
but it also makes intuitive sense that a selection of 
a diverse set of points yields an accurate matrix re¬ 
construction. The computational complexity of Algo¬ 
rithm [T] is, however, similar or even higher than for 
manifold learning because the spectral decomposition 
of a dense graph is required, whereas Laplacian eigen- 
maps operate on sparse matrices. An approach for ef¬ 
ficient sampling proposed in m works with the dual 
representation of K = Y T Y to obtain Q = YY' , with 
Q being hopefully smaller than the matrix K. Consid¬ 
ering that we work with a Gaussian kernel matrix, this 
factorization corresponds to the inner product in fea¬ 
ture space (f>(xi) T <f>(xj) of the original points x*,Xj. It 
is noted in the literature that the Gaussian kernel cor¬ 
responds to an infinite dimensional feature space [27| . 
Since we work with symmetric, positive definite matri¬ 


ces, we can calculate a Cholesky decomposition. How¬ 
ever, the dual representation has the same size as the 
original matrix and therefore yields no improvement. 

Algorithm 2 Efficient approximation of DPP sam¬ 
pling_ 

Require: Point set X, subset cardinality fc, nearest 
neighbors to, update function / 

1: Initialize D = l n 
2: for 1 to k do 

3: Select i G 1... n with probability p{i) oc Di 

4: J 4— J VJ i 

5: Calculate Aj = ||x, — xj\\, Vj G 1... n 

6: Set to nearest neighbors as neighborhood M t 

based on {A, }, -- i ... T/ 

7: Dj «— Dj ■ /(Aj), Vj G Ni 

8: Optional: Calculate covariance C, in local 

neighborhood J\fi around Xj 
9: end for 

10: return J and optionally {Ci}i e j 


To cope with the high computational costs of exact 
DPP sampling, we present an efficient approximation 
in Algorithm [2j The computational complexity is 
0{ndk). Vector D G R" models the probabilities for 
the selection of points as does ||i?i|| 2 in the original 
DPP sampling algorithm. The algorithm proceeds by 
sampling k points. Note that the cardinality of the 
subset cannot be set in the original DPP sampling al¬ 
gorithm, which led to the introduction of fc-DPPs m, 
At each iteration we select one point xy with probabil¬ 
ity p(i) oc Di. Next we calculate distances { Aj}j— i... n 
of the selected point Xi to all points in X. Based on 
these distances we identify a local neighborhood Mi of 
to nearest neighbors around the selected point x,;. The 
update of the probabilities D is restricted to the neigh¬ 
borhood Mi, which proves advantageous for sampling 
on manifolds. In contrast, Algorithm [l] updates prob¬ 
abilities for all points. If we are interested in achiev- 
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Figure 2: Comparison of sin 2 (a;) and (1 — exp(— x 2 )) 
used for / in Algorithm 2. 


ing a similar behavior to Algorithm [lj the local neigh¬ 
borhood should include all points. The update func¬ 
tion / takes distances A as input, where we consider 
/(A) = sin 2 (A/r) and /(A) = (1 - exp(-A 2 /2cr 2 )), 
as motivated below. In subsequent iterations of the 
algorithm, points close x- L will be selected with lower 
probability. 

We initialize the vector D = 1 „, since it was noted 
in m that the squared norm of the vectors ||-Bi|| 2 
gives rise initially to a fairly uniform distribution be¬ 
cause no points have yet been selected. For the update 
step, Proposition 1 implies that the update of the prob¬ 
abilities for selecting a point changes with sin 2 (0). The 
angle 0 = Z(Bi,Bj) correlates strongly with the dis¬ 
tance \\Bi — Bj\\, since ||Bj|| and ||.Bj|| are initially the 
same. The selection of the eigenvectors in Algorithm]!] 
will most likely choose the ones with the largest eigen¬ 
values. We can therefore draw the analogy to multi¬ 
dimensional scaling (AIDS) [55] with a Gaussian ker¬ 
nel, where AIDS selects the top eigenvectors. Con¬ 
sequently, vectors Bi correspond to low-dimensional 
embeddings produced by multidimensional scaling of 
original points Xj. The characteristic property of AIDS 
is to preserve pairwise distances between the origi¬ 
nal space and the embedding space, permitting the 
approximation \\Bi — Bj\\ « ||ar* — ajj|| . This allows 
us to approximate the update dependent on the an¬ 
gle 9 by the distance of points in the original space, 
sin 2 (0) « sin 2 (||xi — Xj||/r). The scaling factor r en¬ 
sures that values are in the range [—7r/2; 7t/2] . This 
update is actually very similar to the Welsch function 
(1 — exp(— \\xi — Xj\\ 2 /2a 2 )), which is directly related 
to the weights in the kernel matrix and is commonly 
used in machine learning. We illustrate the similar¬ 
ity of both functions in Fig. [2] and focus on the Welsch 
function in our experiments. For subsequent iterations 
of the algorithm, the assumption of a similar norm 
of all vectors Bi is violated, because the projection 



Figure 3: Two landmarks (crosses) selected in pre¬ 
vious steps and two candidates for the current step 
(discs). K-means-1—b selects both candidates equally 
likely since they have the same distance r to the land¬ 
mark (blue). The proposed algorithm more likely se¬ 
lects the green point because also the black cross in¬ 
fluences the landmark selection, yielding an increase 
in diversity. 


on the orthogonal space changes their lengths. Note, 
however, that this change is locally restricted around 
the currently selected point. Since this region is less 
likely to be sampled in the subsequent iterations, the 
assumption still holds for parts of the space that con¬ 
tain most probability. 

Remark: The proposed algorithm bears similarities 
to K-means-1—I- 2|. which replaces the initialization 
through uniform sampling of K-means by a new seed¬ 
ing algorithm. K-means++ seeding is a heuristic that 
samples points based on the minimal distance to pre¬ 
viously selected landmarks. Initially, when only one 
landmark is selected, our proposed algorithm has a 
nearly identical update rule for a maximal neighbor¬ 
hood Mi- In later iterations, the algorithms differ be¬ 
cause K-means++ bases the selection only on the dis¬ 
tance to the nearest landmark, while all landmarks 
influence the probability space in our algorithm. Con¬ 
sequently, our approach potentially yields subsets with 
higher diversity, as illustrated in the Fig. [3] 

3.3 Robust Landmark-based Graph 
Construction 

After selecting the landmarks, the next step in the 
spectral analysis consists of building a graph that ap¬ 
proximates the manifold. Common techniques for the 
graph construction include selection of nearest neigh¬ 
bors or e-balls around each node. Both approaches 
require the setting of a parameter, either the number 
of neighbors or the size of the ball, which is crucial 
for the performance. Setting the parameter too low 
leads to a large number of disconnected components, 
while for many applications one is interested in having 
all points connected to obtain a consistent embedding 
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of all points. Choosing too high values of the param¬ 
eters leads to short cuts, yielding a bad approxima¬ 
tion of the manifold. The appropriate selection of the 
parameters is more challenging on sparsely sampled 
manifolds. This is problematic for the subset selection 
with consecutive Nystrom reconstruction because we 
dramatically reduce the sampling rate to limit compu¬ 
tational complexity. 

To address this issue, we propose a new technique for 
graph construction that takes the initial distribution 
of the points into account. For each landmark a,’j, we 
estimate the covariance matrix C, around this point 
from its nearest neighbors M %, as indicated as optional 
step in Algorithm [2j This corresponds to multivariate 
Gaussian distributions Q(xi,Ci) centered at the land¬ 
mark. A commonly used distance to compare distri¬ 
butions is the Bhattacharyya distance, which in case 
of Gaussian distributions corresponds to 

B{Qi,Qj) = ^{xi-XjfC-'txi-x^+hn ^ 

(7) 

with C = 1 0 1 . This distance is less likely to produce 

short cuts across the manifold because points that fol¬ 
low the local geometry appear much closer than points 
that are off the geometry. Consequently, we replace 
the Euclidean distance for neighborhood selection in 
manifold learning with the Bhattacharyya distance, as 
schematically illustrated in Fig. [4] Space requirements 
of this step are 0(d 2 k). An alternative for limited 
space and large d is to only use the diagonal entries of 
the covariance matrix, requiring O(dk). We summa¬ 
rize all steps of the proposed scalable manifold learning 
in Algorithm [3j 


Algorithm 3 Summary of scalable manifold learning 


1: Select landmarks with approximate DPP sampling 
(Algorithm 0 

2: Construct neighborhood graph on landmarks with 
Bhattacharyya distance (Equ. 0 

3: Calculate low-dimensional embedding based on 
neighborhood graph 

4: Embed non-landmark points by performing out- 
of-sample extension with Nystrom method 


4 Experiments 

In our first experiment, we show that the proposed effi¬ 
cient DPP sampling algorithm is well suited for subset 
selection on non-Euclidean spaces. The benefit for this 
scenario is that we can restrict the update of the sam¬ 
pling probability D to a local neighborhood M% around 



Figure 4: Nearest neighbor selection with Bhat¬ 
tacharyya distance. Covariance matrices (ellipsoids) 
are estimated for landmarks based on the original 
point set. Points marked as O and X have equal 
Euclidean distance, easily leading to short cuts in 
the neighborhood graph. The Bhattacharyya distance 
’ considers the local geometry and is smaller for O than 
for X, better approximating the geodesic distance. 


the current point x. t . This is in line with the motiva¬ 
tion of many manifold learning algorithms, assuming 
that the space behaves locally like a Euclidean space. 
In our experiment, we set the local neighborhood Mi to 
be the 20 nearest neighbors around the selected point 
Xi. The sampling result is shown in Fig. |l(d)| We ob¬ 
tain a point set with high diversity, covering the entire 
manifold. This illustrates that the proposed algorithm 
preserves the DPP characteristic on complex geome¬ 
tries and is therefore appropriate for subset selection 
in the context of non-linear dimensionality reduction. 

In our second experiment, we quantify the recon¬ 
struction error for matrix completion as formulated 
in Equ. We compare the efficient DPP sampling 
result with uniform sampling fTFi and K-means clus¬ 
tering with uniform seeding EH, which performed best 
in several studies including a recent one [2II|. More¬ 
over, we compare to selecting the subset with the 
K-means++ seeding and the K-means++ algorithm, 
which we have not yet seen for landmark selection. We 
construct a Gaussian kernel matrix from 1,000 points 
on a Swiss roll (Fig. 0 and on a fish bowl (Fig. 0. 
The fish bowl dataset is a punctured sphere proposed 
in (26], which is sparsely sampled at the bottom and 
densely at the top, as shown in the supplementary ma¬ 
terial. We select subsets varying in size between 25 
and 100 and set the parameters a = 1 and m = 30 
for the Swiss roll and <7 = 1 and m = 150 for the 
fish bowl. Note that a further improvement can be 
achieved by adapting these parameters to the size of 
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Data 

Sampling 

25 

50 

60 

70 

80 

90 

100 


Uniform 

70.384 

8.006 

4.838 

2.785 

1.676 

0.731 

0.442 


K-means Uniform 

28.124 

3.848 

2.319 

1.393 

0.756 

0.403 

0.235 

Swiss Roll 

K-means++ Seeding 

50.114 

5.832 

3.033 

1.655 

1.013 

0.683 

0.347 


K-means-|—(- 

24.954 

3.575 

1.915 

1.018 

0.711 

0.383 

0.222 


Efficient DPP 

33.036 

3.371 

1.466 

0.844 

0.488 

0.312 

0.202 


Uniform 

44.026 

8.394 

6.512 

4.678 

1.025 

0.935 

0.758 


K-means Uniform 

12.627 

1.230 

0.612 

0.393 

0.192 

0.119 

0.080 

Fish Bowl 

K-means-f—(- Seeding 

22.841 

2.337 

1.436 

0.643 

0.252 

0.089 

0.045 


K-means+-1- 

11.503 

0.859 

0.402 

0.150 

0.096 

0.039 

0.021 


Efficient DPP 

10.846 

0.657 

0.249 

0.095 

0.014 

0.005 

0.002 


Table 1: Average reconstruction errors over 50 runs for several sampling schemes: uniform, K-means uniform, 
K-means++ seeding, K-means++, and efficient DPP. The subset sizes vary from 25 to 100. Best results are 
highlighted in bold face. 



Figure 5: Fish bowl dataset used for experiments. It is 
a punctured sphere, which is sparsely sampled at the 
bottom and densely at the top. Uniform sampling is 
likely to pick points at the top, neglecting the bottom 
part. 


the subset. For smaller subsets, larger a and m lead 
to improvements. We show the average reconstruction 
error for the different methods and datasets in Table[l] 
calculated over 50 different runs. Generally, the perfor¬ 
mance of uniform sampling is worst. The K-means++ 
seeding yields better results. K-means improves the 
results on both initializations, where K-meansH—I- ben¬ 
efits from the better seeding. The diverse set of land¬ 
marks selected with efficient DPP sampling leads to 
the lowest average reconstruction errors for almost all 
settings. 

In our third experiment, we perform manifold learning 
with Laplacian eigennraps on a point set consisting of 
10 million points lying on a Swiss roll. The dataset 
is too large to apply manifold learning directly. We 
select 2,500 landmark points with the efficient DPP 
sampling algorithm and estimate the local covariance 
matrices, which we feed into the manifold learning al¬ 


gorithm. We compare the the graph construction with 
Euclidean and Bhattacharyya neighborhood selection. 
The graph construction yields a weight matrix W and 
the corresponding degree matrix Djj = ]TL Wij. The 
weight matrix W is a sparse version of the kernel ma¬ 
trix Kj x j , where the sparsity is controlled by the num¬ 
ber of nearest neighbors in the graph. The generalized 
eigenvalue problem solved in Laplacian eigenmaps is 

(V - W)<t>j = (8) 

with eigenvalues A j and eigenvectors <f>j. The l eigen¬ 
vectors corresponding to the l smallest non-zero eigen¬ 
values A jj = A j,j = 1 ,...l constitute the embed¬ 
ding in Z-dimensional space, <E>j = [cj>i, ..., </>;]. Fig. [6] 
shows the low-dimensional embedding in 2D for Eu¬ 
clidean (first row) and Bhattacharyya neighborhood 
selection (second row). We vary the number of near¬ 
est neighbors in the graph from 25 to 500. The re¬ 
sults show that the Bhattacharyya based neighbor¬ 
hood selection is much more robust with respect to 
the number of neighbors. Moreover, we notice that for 
Euclidean neighborhood selection the points seem to 
cluster along stripes. While we observe this effect also 
on the Bhattacharyya based embeddings, it is much 
less pronounced. 

4.1 Image Data 

After having evaluated each of the steps of the pro¬ 
posed approach separately, we now present results for 
scalable manifold learning on image data. We work 
with two datasets, one consisting of handwritten dig¬ 
its and a second one consisting of patches extracted 
from 3D medical images. Each dataset is too large 
to apply manifold learning directly. Consequently, we 
select landmarks with the discussed method, perform 
manifold learning on the landmarks with the Bhat¬ 
tacharyya distance, and use the Nystrom method to 
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(a) 25 (b) 60 



(c) 200 (d) 500 



Figure 6 : Selection of 2,500 landmarks from a set of 10 million points. Embedding of landmarks to 2D with 
Laplacian eigenmaps. Results for Euclidean (first row) and Bhattacharyya neighborhood selection (second row) 
are shown. Varying the number of nearest neighbors between 25 and 500, we observe that Bhattacharyya based 
embeddings are more robust to the parameter setting and of higher quality. 


embed the entire point set. We only consider the di¬ 
agonal entries of the covariance matrices due to space 
limitations. Given the low-dimensional embedding of 
the landmarks $7 £ R. kxl and the diagonal matrix of 
eigenvalues A £ R* xZ , we use out-of-sample extension 
with the Nystrom method to calculate the embedding 
of the remaining points 


*J = K JXJ 


K id = 


K, 


h3 




(9) 

,i£j,j£j ( 10 ) 


where K is the normalized kernel for Laplacian eigen¬ 
maps and the expectation is calculated over landmark 
points [7j. 

Since it is difficult to quantify the quality of the em¬ 
bedding, we use the labels associated to the image 
data to perform nearest neighbor classification in the 
low dimensional space. We expect advantages for the 
DPP landmark selection scheme because a diverse set 
of landmarks spreads the entire point set in the embed¬ 
ding space and helps the classification. For the same 
reason, we also expect a good performance for the K- 
means+H- seeding algorithm. Note that we abstain 
from pre-processing the data and from applying more 
advanced classifiers because we are not interested in 
the absolute classification performance but only in the 
relative performances across the different landmark se¬ 
lection methods. 


4.1.1 MNIST 

We work with the MNIST dataset |21j, consisting of 
60,000 binary images of handwritten digits for train¬ 
ing and 10,000 for testing, with a resolution of 28 x 28 
pixels. We set the neighborhood size to m = 5000 and 
cr = 5. We embed the images into 100 dimensional 
space with Laplacian eigenmaps. Fig. |7(a)| shows the 
statistical analysis over 20 repetitions for several land¬ 
mark selection schemes across different numbers of 
landmarks fc, as well as the Bhattacharyya based graph 
construction. The results show that the K-means++ 
seeding outperforms the uniform initialization, where 
K-meansH—b cannot further improve the initialization. 
Moreover, we observe a significant improvement in 
classification performance for approximate DPP sam¬ 
pling compared to K-Means-f—(- seeding. Finally, the 
Bhattacharyya based graph construction further im¬ 
proves the results. 

4.1.2 Head and Neck 

In a second classification experiment, we use 3D CT 
scans from the head and neck region, having a reso¬ 
lution of 512 x 512 x 145 voxels. These images were 
acquired for radiation therapy of patients with head 
and neck tumors. Fig. [ 8 ] shows one cross sectional 
slice with segmentations of three structures of risk: left 
parotid (green), right parotid (blue), and brainstem 
(green). The segmentation of these structures during 
treatment planning is of high clinical importance to 
ensure that they obtain a low radiation dose. We ex- 
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Figure 7: Classification results for MNIST and head and neck data. Comparison of different subset selection 
schemes (uniform, K-means with uniform seeding, K-means++ seeding, K-means++, efficient DPP, efficient DPP 
with Bhattacliaryya based graph construction) for varying numbers of selected landmarks. Bars indicate mean 
classification performance and error bars correspond to standard deviation. *, **, and *** indicate significance 
levels at 0.05, 0.01, and 0.001, respectively. 



Figure 8: Cross sectional head and neck slice. Overlaid 
are the perimeters of left parotid (green), right parotid 
(blue), and brainstem (green). 


tract image patches from the left and right parotid 
glands, the brainstem and the surrounding region. We 
are interested in classifying patches into these four 
groups, where the outcome can readily serve in seg¬ 
mentation algorithms. We work with patches of size 
7 x 7 x 3 to reflect the physical resolution of the data 
which is 0.98 x 0.98 x 2.5 mm 3 . This results in roughly 
150,000 patches extracted from three scans. 80,000 
patches are used for training and the remaining ones 
for testing. We extract landmarks from the train¬ 
ing patches (m = 5000, a = 5) and embed them 
into 50 dimensional space with Laplacian eigenmaps. 
Fig. |7(b)] shows the statistical analysis of the classifica¬ 
tion performance over 20 repetitions for various num¬ 
bers of selected landmarks and selection schemes. K- 


means-l —V seeding outperforms uniform sampling. K- 
means clearly improves the uniform initialization. K- 
means-l—b shows a similar performance to the initial 
seeding. Similar to the experiments for MNIST, the 
efficient DPP approximation leads to significantly bet¬ 
ter classification results, with an additional gain for the 
Bhattacharyya based graph construction. In addition 
to the significant improvement, our runtime measure¬ 
ments showed that our unoptimized Matlab code for 
efficient DPP sampling runs approximately 15% faster 
than an optimized MEX version of K-means. 

5 Conclusion 

We have presented contributions for two crucial is¬ 
sues of scalable manifold learning: (i) efficient sam¬ 
pling of diverse subsets from manifolds and (ii) ro¬ 
bust graph construction on sparsely sampled mani¬ 
folds. Precisely, we analyzed the sampling from de- 
terminantal distributions on non-Euclidean spaces and 
proposed an efficient approximation of DPP sampling. 
The algorithm is well suited for landmark selection on 
manifolds because probability updates are locally re¬ 
stricted. Further, we proposed the local covariance es¬ 
timation around landmarks to capture the local char¬ 
acteristics of the space. This enabled a more robust 
graph construction with the Bhattacharyya distance 
and yielded low dimensional embeddings of higher 
quality. We compared to state-of-the-art subset selec¬ 
tion procedures and obtained significantly better re¬ 
sults with the proposed algorithm. 
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